{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 聚类算法在不参考任何标签的情况下，从数据中找出不同的分组。\n",
    "\n",
    "#### 这里使用的算法是高斯混合模型 (Gaussian Mixture Model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 导入Iris数据集\n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 导入模型的类\n",
    "from sklearn.mixture import GaussianMixture\n",
    "\n",
    "# 实例化一个模型\n",
    "model = GaussianMixture(n_components=3, covariance_type='full')\n",
    "\n",
    "# 训练模型，注意，并没有提供 y 的值\n",
    "model.fit(iris.data)\n",
    "\n",
    "# 识别聚类的标签\n",
    "ypredict = model.predict(iris.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 对iris数据做降维操作，以方便做可视化\n",
    "from sklearn.decomposition import PCA\n",
    "model = PCA(n_components=2)\n",
    "model.fit(iris.data)\n",
    "X2d = model.transform(iris.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这三个数据的0轴长度都相同\n",
    "ypredict.shape, X2d.shape, iris.target.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 把这三个数据合并成一个DataFrame，方便用seaborn输出\n",
    "df = DataFrame({'pca1': X2d[:,0], 'pca2': X2d[:,1],\n",
    "                'cluster': ypredict, 'species': iris.target})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用seaborn输出图形\n",
    "import seaborn as sns\n",
    "sns.lmplot(\"pca1\", \"pca2\", data=df, hue='species', col='cluster', fit_reg=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 可以看出，即使没有样本的species标签可用，这些花的特征度量值之间的明显差异，使得通过一个简单的聚类算法就能够发现样本中存在这样的不同分组。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
